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Abstract 

This paper discusses the problem of learn- 
ing language from unprocessed text and 
speech signals, concentrating on the prob- 
lem of learning a lexicon. In particular, it 
argues for a representation of language in 
which linguistic parameters like words are 
built by perturbing a composition of exist- 
ing parameters. The power of this repre- 
sentation is demonstrated by several exam- 
ples in text segmentation and compression, 
acquisition of a lexicon from raw speech, 
and the acquisition of mappings between 
text and artificial representations of mean- 
ing. 

1 Motivation 

Language is a robust and necessarily redundant 
communication mechanism. Its redundancies com- 
monly manifest themselves as predictable patterns 
in speech and text signals, and it is largely these 
patterns that enable text and speech compression. 
Naturally, many patterns in text and speech re- 
flect interesting properties of language. For ex- 
ample, the is both an unusually frequent sequence 
of letters and an English word. This suggests us- 
ing compression as a means of acquiring under- 
lying properties of language from surface signals. 
The general methodology of language-learning-by- 
compression is not new. S ome notable early propo- 
nents includ ed Ch omsky ( 1955| ), Solomonoff (196C) 



and Harris (|1968| ), and compression has been used 
as the basis for a wide variety of computer programs 



( Olivier, 1968; Wolff, 1982; 


Ellison, 1992 


; stolckc. 


199 1 


; Chen, 1995 


; Uartwright and Brent, 1994) 



among others. 

1.1 Patterns and Language 

Unfortunately, while surface patterns often reflect 
interesting linguistic mechanisms and parameters, 
they do not always do so. Three classes of exam- 
ples serve to illustrate this. 



1.1.1 Extralinguistic Patterns 

The sequence it was a dark and stormy night is 
a pattern in the sense it occurs in text far more 
often than the frequencies of its letters would sug- 
gest, but that does not make it a lexical or gram- 
matical primitive: it is the product of a complex 
mixture of linguistic and extra-linguistic processes. 
Such patterns can be indistinguishable f rom desired 
ones. For exam ple, in the Brown corpus (Francis and 
Kucera, 1982) scratching her nose occurs 5 times, 
a corpus-specific idiosyncrasy. This phrase has the 
same structure as the idiom kicking the bucket. It is 
difficult to imagine any induction algorithm learn- 
ing kicking the bucket from this corpus without also 
(mistakenly) learning scratching her nose. 

1.1.2 The Definition of Interesting 

This discussion presumes there is a set of desired 
patterns to extract from input signals. What is this 
set? For example, is kicking the bucket a proper lexi- 
cal unit? The answer depends on factors external to 
the unsupervised learning framework. For the pur- 
poses of machine translation or information retrieval 
this sequence is an important idiom, but with re- 
spect to speech recognition it is unremarkable. Sim- 
ilar questions could be asked of subword units like 
syllables. Plainly the answers depends on the learn- 
ing context, and not on the signal itself. 

1.1.3 The Definition of Pattern 

Any statistical definition of pattern depends on 
an underlying model. For instance, the sequence the 
dog occurs much more frequently than one would ex- 
pect given an independence assumption about let- 
ters. But for a model with knowledge of syntax 
and word probabilities, there is nothing remarkable 
about the phrase. Since all existing models have 
flaws, patterns will always appear that are artifacts 
of imperfections in the learning algorithm. 

These examples seem to imply that unsupervised 
induction will never converge to ideal grammars and 
lexicons. While there is truth to this, the rest of this 
paper describes a representation of language that 
bypasses many of the apparent difficulties. 
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Figure 1: A compositional representation. 
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Figure 2: A coding of the first few words of a hypo- 
thetical lexicon. The first two columns can be coded 
succinctly, leaving the cost of pointers to component 
words as the dominant cost of both the lexicon and 
the representation of the input. 



2 A Compositional Representation 



The examples in sections |1 . 1 . 1| and [1.1. 2| seem to 
imply that any unsupervised language learning pro- 
gram that returns only one interpretation of the in- 
put i s bou nd to make many mistakes. And sec- 
tion 1.1.3 implies that decisions about linguistic 



units must be made relative to their representations. 
Both of these issues are addressed if linguistic units 
(for now, words in the lexicon) are built by com- 
posing other units. For example, kicking the bucket 
might be represented by the composition of kicking, 
the and &ucfcei.[] Of course, words that are merely 
the composition of their parts are uninteresting and 
need not be included in the lexicon. The motivation 
for including a word in the lexicon must be that it 
behaves differently than its parts imply. If this is 
the case, a word is a perturbation of a composition. 

In the case of kicking the bucket the perturbation 
is one of both meaning and frequency. For scratching 
her nose the perturbation may just be of frequency.n 
This is a very natural representation from the view- 
point of language. It correctly predicts that both 
phrases inherit their sound and syntax from their 
component words. At the same time it leaves open 
the possibility that idiosyncratic information will be 
attached to the whole, as with the meaning of kick- 
ing the bucket. This structure is very much like the 
class hierarchy of a modern programming language. 
It is not the same thing as a context-free grammar, 
since each word does not act in the same way as the 
default composition of its components. 

Figure [l] illustrates a recursive decomposition (un- 
der concatenation) of the phrase national football 
league. The phrase is broken into three words, each 
of which are also decomposed in the lexicon. This 
process bottoms out in the terminal characters. This 
is a real decomposition achieved by a program de- 



1 The simplest composition operator is concatenation; 
sections |^ and ^| discuss more interesting ones. 

2 Naturally, an unsupervised learning algorithm with 
no access to meaning will not treat these two examples 
differently. 



scribed in section [|. Not shown are the perturba- 
tions (in this case merely probability specifications) 
that distinguish each word from its parts. This gen- 
eral framework extends to other perturbations. For 
example, the word wanna is naturally thought of as 
a composition of want and to with a sound change. 
And in speech the three different words to, two and 
too may well inherit the sound of a common ancestor 
while introducing new syntactic and semantic prop- 
erties. 

2.1 Coding 

Of course, for this representation to be more than 
an intuition both the composition and perturbation 
operators must be exactly specified. In particular, a 
code must be designed that enables a word (or the 
input) to be expressed in terms of its parts. As a sim- 
ple example, suppose that the composition operator 
is concatenation, that terminals are characters, and 
that the only perturbation operator is the ability to 
express the probability of a word independently of 
the probability of its parts. Then to code either the 
input or a (nonterminal) word in the lexicon, the 
number of component words in the representation 
is written, followed by a code for each component 
word. Naturally, each word in the lexicon must also 
be linked to its code, and under a near-optimal cod- 
ing scheme like a Huffman code, the code length will 
be related to the probability of the word. Thus, link- 
ing a word to a code serves also to specify the word's 
probability, its only perturbation. Furthermore, if 
words are written down in order of decreasing prob- 
ability, a Huffman code for a large lexicon can be 
specified using a negligible number of bits (provid- 
ing the number of codes of each length is sufficient). 
This and the near-negligible cost of writing down the 
number of components in word representations will 
not be discussed further. Figure @ presents a portion 
of an encoding of a hypotheticallexicon under this 
scheme. 



2.2 MDL 

Given a coding scheme and a particular lexicon (and 
a parsing algorithm) it is in theory possible to calcu- 
late the minimum length encoding of a given input. 
Part of the encoding will be devoted to the lexicon, 
the rest to representing the input in terms of the 
lexicon. The lexicon that minimizes the combined 
description length of the lexicon and the input max- 
imally compresses the input. In the sense of Rissa- 
nen's minimum description-lengt h (MDL) principle 
(Rissanen, 1978; Rissanen, 1989) this lexicon is the 
theory that best explains the data, and one can hope 
that the patterns in the lexicon reflect the underly- 
ing mechanisms and parameters of the language that 
generated the input. 

2.3 Properties of the Representation 

Representing words in the lexicon as perturbations 
of compositions has a number of desirable properties. 

• The choice of composition and perturbation op- 
erators captures a particular detailed theory of 
language. They can be used, for instance, to 
reference sophisticated phonological and mor- 
phological mechanisms. 

• The length of the description of a word is a mea- 
sure of its linguistic plausibility, and can serve 
as a buffer against learning unnatural coinci- 
dences. 

• Coincidences like scratching her nose do not ex- 
clude desired structure, since they are further 
broken down into components that they inherit 
properties from. 

• Structure is shared: the words blackbird and 
blackberry can share the common substructure 
associated with black, such as its sound and 
meaning. As a consequence, data is pooled for 
estimation, and representations are compact. 

• Common irregular forms are compiled out. For 
example, if went is represented in terms of go 
(presumably to save the cost of unnecessarily 
reproducing syntactic and semantic properties) 
the complex sound change need only be repre- 
sented once, not every time went is used. 

• Since parameters (words) have compact repre- 
sentations, they are cheap from a description 
length standpoint, and many can be included 
in the lexicon. This allows learning algorithms 
to fit detailed statistical properties of the data. 

This coding scheme is very similar to that found in 
popula r dictionary-based com pression schemes like 
LZ78 (Ziv and Lempel, 1978). It is capable of com- 
pressing a sequence of identical characters of length 
n to size O(logn). However, in contrast to compres- 
sion schemes like LZ78 that use deterministic rules 



to add parameters to the dictionary (and do not ar- 
rive at linguistically plausible parameters) , it is pos- 
sible to perform more sophisticated searches in this 
representation. 

3 A Search Algorithm 

Since the class of possible lexicons is infinite, the 
minimization of description length is necessarily in- 
exact and heuristic. Given a fixed lexicon, the 



expectati on-maximization algorithm ( Dempster et 
al., 1977) can be used to arrive at a (locally) op- 



timal set of probabilities and codelengths for the 
words in the lexicon. For composition by concate- 
nation, the algorithm reduce s to the special ca se of 
the Baum- Welch procedure ( | Baum ct al., 1970 ) dis- 
cussed in ( Deligne and Bimbot, 1995| ). In general, 
however, the parsing and re-estimation involved in 
EM can be considerably more complicated. To up- 
date the structure of the lexicon, words can be added 
or deleted from it if this is predicted to reduce the 
description length of the input. This algorithm is 
summarized in figure ^.R 



Start with lexicon of terminals. 
Iterate 

Iterate (EM) 

Parse input and words using current lexicon. 

Use word counts to update probabilities. 
Add words to the lexicon. 
Iterate (EM) 

Parse input and words using current lexicon. 

Use word counts to update probabilities. 
Delete words from the lexicon. 

Figure 3: An iterative search algorithm. Two it- 
erations of the inner loops are usually sufficient for 
convergence, and for the tests described in this pa- 
per after 10 iterations of the outer loop there is little 
change in the lexicon in terms of either compression 
performance or structure. This algorithm is quite 
practical for the sizes of problems presented in this 
paper. 



3.1 Adding and Deleting Words 

For words to be added to the lexicon, two things are 
needed. The first is a means of hypothesizing can- 
didate new words. The second is a means of evalu- 
ating candidates. One reasonable means of generat- 
ing candidates is to look at pairs (or bigger tuples) 

3 For the composition operators and test sets we have 
looked at, using single (Viterbi) parses produces almost 
exactly the same results (in terms of both compression 
and lexical structure) as summing probabilities over mul- 
tiple parses. 



of words that are composed in the parses of words 
and the input. So long as the composition opera- 
tor is associative, a new word can be created from 
such a pair and substituted in place of it wherever 
it appears. For example, if water and melon are fre- 
quently composed, then a good candidate for a new 
word is water o melon = watermelon, where o is the 
composition operator. In order to evaluate whether 
the addition of such a new word is likely to reduce 
the description length of the input, it is necessary 
to record during the EM step posterior counts c(W) 
for each composed word pair W = wi o u; 2 ■ 

The effect on the description length of adding a 
new word can not be exactly computed. Its addition 
will not only affect the counts of other words, but 
may also cause other words to be added or deleted. 
Fortunately, simple approximations of the change 
are adequate for evaluating word candidates. For 
example, if Viterbi analyses are being used then the 
new word W (if worth adding at all) will completely 
replace all compositions of wi and W2, though each 
of these words will be used once in the representation 
of W. Therefore, if c(w) is the count of a word w be- 
fore W is added to the lexicon, and c'(w) the count 
after, then under the assumption that otherwise 
parses are stable across the change, c'(W) = c(W), 
c'K) = c(w 1 )-c(W) + l, c'(w 2 ) = c(w 2 )-c(W) + l 
and otherwise c'(w) = c(w). Of course, all word 
probabilities change because of the change in total 
word count. Since the codelength of a word w with 
probability p(w) is approximately — log p(w), the es- 
timated total change in description length caused by 
adding a new word W to a lexicon L is 

A w -c'(W) log p' (W) + d.l. (changes) + 
(—c'(w) logp'(w) + c(w) logp(w)) 

weL 

where d.l. (changes) represents the cost of writing 
down the perturbations involved in the representa- 
tion of IT.] This can be computed quite efficiently. 
If A < the word W is predicted to reduce the to- 
tal description length and is added to the lexicon. 
In our implementation, all candidates with negative 
A are added simultaneously; subsequent delete steps 
can fix mistakes. 

Similar heuristic approximations can be used to 
estimate the benefit of deleting words. In that case, 
a reasonable assumption is that if a word is deleted 
its representation replaces it everywhere. Again this 
is not necessarily correct, but serves adequately. 



3.2 Search Properties 



(de Marcken, 1995a 


; Pereira 


Carroll and Charnia 


k, 1992). 



4 See ( |de Marcken, 1995b ) for more detailed discus- 
sion of approximations. The actual schemes used in the 
tests discussed in this paper are slightly more compli- 
cated than those presented here. For example, it is not 
assumed that the representation of W after the change 
will necessarily be wi o W2 and the possibility that ei- 
ther or both of «ii and W2 will subsequently be deleted 
is considered. Further, unless Viterbi analyses are being 
used, c'(W) is not assumed to be exactly c(W). 



The search algorithm described above generally es- 
capes this problem, in large part because of the un- 
derlying representation. The reason is that hidden 
structure is largely a "compile-time" phenomena. 
During parsing all that is important about a word is 
its surface form and codelength. The internal rep- 
resentation does not matter. Therefore, the internal 
representation is free to reorganize at any time; it 
has been decoupled. This allows structure to be built 
bottom up or for structure to emerge inside already 
existing parameters. Furthermore, since parameters 
(words) encode surface patterns, their use is con- 
strained and they tend not have competing roles, in 
contrast, for instance, to hidden nodes in neural net- 
works. And since the number of parameters is not 
fixed, when words do start to have multiple conflict- 
ing roles, they can be split with common substruc- 
ture shared. Finally, since add and delete cycles can 
compensate for initial mistakes, inexact heuristics 
can be used for adding and deleting words. 

4 Concatenation Results 

The simplest reasonable instantiation of the 
composition-and-perturbation framework is with the 
concatenation operator and probability perturba- 
tion. This instantiation has been tested on problems 
of text segmentation and compression. Given a text 
document, the search algorithm tries to find the lex- 
icon that minimizes total description length. For 
testing purposes, delimiters like spaces and punctu- 
ation are removed from the input. Define true words 
to be minimal character sequences bordered by de- 
limiters in the original input. Since the search algo- 
rithm parses the input as it compresses it, it can out- 
put the optimal segmentation of the input into words 
drawn from the lexicon. These words are themselves 
decomposed in the lexicon, and can be considered 
to form a tree that terminates in characters. This 
tree can have no more than 0(n) nodes for an input 
of length n, even though there are 0(n 2 ) possible 
true words in such an input; thus, the segmenta- 
tion tree contains considerable information. Define 
recall to be the percentage of true words that oc- 
cur at some level of the segmentation tree. Define 
crossing-brackets to be the percentage of true words 
that violate the segmentation tree structure.^ 

The algorithm was applied to two texts, a low- 
ercase version of the million-word Brown corpus 
with spaces and punctuation removed, and 4 mil- 
lion characters of Chinese news articles in a two- 
byte/character format. In the case of the Chinese, 



The true word moon in the input the moon is a 
crossing-bracket violation of them in the (partial) seg- 
mentation tree [[them] [o] [on]]. 



which contains no inherent separators like spaces, 
segmentation performance is measured relative to 
another computer segmentation program that had 
access to a (human-created) lexicon. The algorithm 
was given the raw encoding and had to deduce the 
internal two-byte structure. In the case of the Brown 
corpus, word recall was 90.5% and crossing-brackets 
was 1.7%. For the Chinese word recall was 96.9% 
and crossing-brackets was 1.3%. In the case of both 
English and Chinese, most of the recall violations 
were words that occurred only once in the corpus. 
Thus, the algorithm did an extremely good job of 
learning words and properly using them to segment 
the input. Furthermore, the crossing-bracket mea- 
sure indicates that the algorithm makes very few 
clear mistakes. Of course, the hierarchical lexical 
representation does not make a commitment to what 
levels are "true words" and which are not; about five 
times more nodes exist in the segmentation tree than 
true words. Experiments in section |^ demonstrate 
that for most applications this excess structure is not 
only not a problem, but desirable. Figure [I| displays 
some of the lexicon learned from the Brown corpus. 

The algorithm was also run as a compressor 
on a lower-case version of the Brown corpus with 
spaces and punctuation left in. All bits neces- 
sary for exactly reproducing the input were counted. 
Compression performance is 2.12 bits/char, signifi- 
cantly lower than popular algorithms like gzip (2.95 
bits/char). This is the best text compression result 
on this corpus that we are awar e of, and should no t 
be confused with lower figures (Brown et al., 1992) 
that do not include the cost of parameters. Furthcr- 
more, because the compressed text is stored in terms 
of linguistic units like words, it can be searched, in- 
dexed, and parsed without decompression. 



Rank Word 



[s] 

1 [the] 

2 [and] 

3 [a] 

4 [of] 

5 [in] 

6 [to] 

500 [students] 

501 [material] 

502 [urn] 

503 [words] 

504 [period] 

505 [class] 

506 [question] 

5000 [ [ing] [them] ] 

5001 [[mon][k]] 

5002 [ [re] [lax] ] 

5003 [ [rig] [id] ] 

5004 [ [connect] [ed] ] 

5005 [[i][k]] 

5006 [ [hu] [t] ] 

26000 [[pleural] [blood] [supply]] 

26001 [ [anordinary] [happy] [family]] 

26002 [[f] [eas] [ibility] [of]] 

26003 [[lunar] [brightness] [distribution]] 

26004 [[primarily] [diff] [using]] 

26005 [[sodium] [tri] [polyphosphate]] 

26006 [[charcoal] [broil] [ed]] 

Figure 4: Sections of a 26,027 word lexicon learned 

from the Brown corpus, ranked by frequency. The 
words in the less-frequent half are listed with 
their first-level decomposition. Word 5000 causes 
crossing-bracket violations, and words 26002 and 
26006 have internal structure that causes recall vio- 
lations. 



5 Learning Meanings 

Unsupervised learning algorithms are rarely used in 
isolation. The goal of this work has been to ex- 
plain how linguistic units like words can be learned, 
so that other processes can make use of these 
units. In this section a means of learning the map- 
pings between words and artificial representations 
of meanings is described. The composition-and- 
perturbation representation handles this application 
neatly. 

Imagine that text utterances are paired with rep- 
resentations of meaningj^] and that the goal is to find 
the minimum-length description of both the text and 
the meaning. If there is mutual information between 
the meaning and text portions of the input, then 
better compression is achieved if the two streams 
are compressed simultaneously than independently. 
If a text word has an associated meaning, then writ- 
ing down that word to account for some portion of 

6 This framework is easily extended to handle multi- 
ple ambiguous meanings (with and without priors) and 
noise, but these extensions are not discussed here. 



text also accounts for some portion of the meaning 
of that text. The remaining meaning can be written 
down more succinctly. Thus, there is an incentive 
to associate meaning with sound, although of course 
the association pays a price in the description of the 
lexicon. 

Although it is obviously a naive simplification, 
many of the interesting properties of the composi- 
tional representation surface even when meanings 
are treating as sets of arbitrary symbols. A word is 
now both a character sequence and a set of meaning 
symbols. The composition operator concatenates 
the characters of its operands and takes the union 
of their meaning symbols. Of course, there must 
be some way to perturb the default meaning of a 
word. One way to do this is to explicitly write out 
any symbols that arc present in the word's meaning 
but not in its components, or vice versa. Thus, the 
word red {RED} might be represented as r o e o 
d+RED. Given an existing word berry {BERRY}, 



the red berry cranberry {RED BERRY} can be rep- 
resented coroaono berry {BERRY}+RED. 

5.1 Results 

To test the algorithm's ability to infer word mean- 
ings, 10,000 utterances from an unsegmented textual 
database of mothers' speech to children were paired 
with representations of meaning, constructed by as- 
signing a unique symbol to each root wnrH in the* vo- 



cabulary. For example, the sentence andwhatishep- 
aintingapictureof is paired with the unordered mean- 
ing { AND WHAT BE HE PAINT A PICTURE 
OF In the first experiment, the algorithm re- 
ceived these pairs with no noise or ambiguity, using 
a perturbation operator such that each symbol's cost 
was 10 bits. After 8 iterations of training on the text 
portion of the input and then a further 8 iterations 
of training on both the text and the meaning, the 
text was parsed again. The meanings of the result- 
ing word sequences (as defined by the lexicon) were 
compared with the true meaning of the input. Sym- 
bol accuracy was 98.9%, recall was 93.6%. Used to 
identify the true meaning from among the meanings 
of the previous 20 sentences, the program selected 
correctly 89. 1% of the time, or ranked the true mean- 
ing tied for first 10.8% of the time. 

A second test was performed in which during 
training the algorithm received three possible mean- 
ings for each utterance, the true one and also the 
meanings of the two surrounding utterances. A uni- 
form prior was used. Despite the ambiguity, during 
testing symbol accuracy was again 98.9%, recall was 
75.3%. 

The final lexicon includes extended phrases, but 
meanings tend to filter down to the proper level. For 
instance, although the words duck, ducks, theducks 
and duckdrink are all in the lexicon and contain the 
meaning DUCK, the symbol is only written once, 
in the description of duck. All others words inherit 
the symbol from this word. Similar results hold for 
similar experiments on the Brown corpus. For ex- 
ample, scratching her nose inherits its meaning com- 
pletely from its parts, while kicking the bucket does 
not. This is exactly the result argued for in the mo- 
tivation section of this paper, and illustrates why in 
our framework there is little harm in occasionally 
adding unnecessary words like scratching her nose 
to the lexicon. 

6 Other Extensions 

We have performed other experiments using this rep- 
resentation and search algorithm, on tasks in un- 



7 The unordered nature of the second data stream 
greatly increases the complexity of the EM algorithm, 
which can no longer be implemented efficiently through 
dynamic programming. Although too complex to be dis- 
cussed here, in our implementation a factorial approxi- 
mation is used to succinctly and efficiently represent for- 
ward and backward probabilities. 



supervised learning from speech and grammar in- 
duction. Figure o contains a small portion of a 
lexicon learned from 55,000 utterances of continu- 
ous speech by multiple speakers. The utterances 
are taken from dictated Wall Street Journal articles. 
The concatenation operator was used with phonemes 
as terminals. A second layer was added to the frame- 
work to map from phonemes to speech; these exten 



sions a re described in more detail in (do Marcken 



1995b ) . The sound models for the phonemes were es- 



timated independently on a separate corpus of hand- 
segmented speech. Although the phoneme models 
are extremely poor, many words are recognizable, 
and this is the first significant lexicon learned di- 
rectly from spoken speech without supervision. 

If the composition operator makes use of context, 
then this framework extends naturally to a varia- 
tion of stochastic context-free grammars in which 
composition corresponds to t ree substitu tion and 
the inside-outside algorithm ( Baker, 1979| ) is used 
for re-estimation. In particular, if each word is as- 
sociated with a parent class, and these classes are 
permissible terminals, then "words" act as produc- 
tion rules. For example, a possible word with class 
vp is [yptake off<np>~\ , which can be represented by 
[•up<t)><p><7ip>] ° lytake] o [poff] o Inpo] where o is 
a special symbol that indicates a class is not ex- 
panded. Furthermore, [vp < v><p><np>] may be de- 
composed into [vp<v><pp>] o [ v o] o [pp<p><np>~\ . In 
this way syntactic structure emerges in the inter- 
nal representation of relatively flat production rules. 
This framework offers the significant advantage that 
non-independent rule expansions can be accounted 
for without sacrificing structure. We are currently 
looking at various methods for automatically acquir- 
ing classes; in initial experiments some of the first 
classes learned from text are the class of vowels, of 
consonants, and of verb endings. 

7 Conclusions 

No previous unsupervised language-learning proce- 
dure has produced structures that match so closely 
with linguistic intuitions. We take this as a vindi- 
cation of the perturbation-of-compositions represen- 
tation. Its ability to capture the statistical and lin- 
guistic idiosyncrasies of large structures without sac- 
rificing the obvious regularities within them makes it 
a valuable tool for a wide variety of induction prob- 
lems. 

This research was supported in part by NSF 
grant 9217041-ASC and ARPA under the HPCC and 
AASERT programs. 
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Figure 5: Some words from a lexicon learned from 
55,000 utterances of continuous, dictated Wall Street 
Journal articles. Although many words are little 
more than random gibberish, words representing 
million dollars, Goldman- Sachs, thousand, etc. are 
learned. Furthermore, as word 8950 (long time) 
demonstrates, they are often properly decomposed 
into components. 



